65 research outputs found

    Sorting by weighted inversions considering length and symmetry

    Get PDF
    International audienceLarge-scale mutational events that occur when stretches of DNA sequence move throughout genomes are called genome rearrangements. In bacteria, inversions are one of the most frequently observed rearrangements. In some bacterial families, inversions are biased in favor of symmetry as shown by recent research. In addition, several results suggest that short segment inversions are more frequent in the evolution of microbial genomes. Despite the fact that symmetry and length of the reversed segments seem very important, they have not been considered together in any problem in the genome rearrangement field. Here, we define the problem of sorting genomes (or permutations) using inversions whose costs are assigned based on their lengths and asymmetries. We consider two formulations of the same problem depending on whether we know the orientation of the genes. Several procedures are presented and we assess these procedure performances on a large set of more than 4.4 × 10^9 permutations. The ideas presented in this paper provide insights to solve the problem and set the stage for a proper theoretical analysis

    Sorting Circular Permutations by Super Short Reversals

    Get PDF
    International audienceWe consider the problem of sorting a circular permutation by super short reversals (i.e., reversals of length at most 2), aproblem that finds application in comparative genomics. Polynomial-time solutions to the unsigned version of this problem are known,but the signed version remained open. In this paper, we present the first polynomial-time solution to the signed version of this problem.Moreover, we perform experiments for inferring phylogenies of two different groups of bacterial species and compare our results withthe phylogenies presented in previous works. Finally, to facilitate phylogenetic studies based on the methods studied in this paper, wepresent a web tool for rearrangement-based phylogenetic inference using short operations, such as super short reversals

    STING Report: convenient web-based application for graphic and tabular presentations of protein sequence, structure and function descriptors from the STING database

    Get PDF
    The Sting Report is a versatile web-based application for extraction and presentation of detailed information about any individual amino acid of a protein structure stored in the STING Database. The extracted information is presented as a series of GIF images and tables, containing the values of up to 125 sequence/structure/function descriptors/parameters. The GIF images are generated by the Gold STING modules. The HTML page resulting from the STING Report query can be printed and, most importantly, it can be composed and visualized on a computer platform with an elementary configuration. Using the STING Report, a user can generate a collection of customized reports for amino acids of specific interest. Such a collection comes as an ideal match for a demand for the rapid and detailed consultation and documentation of data about structure/function. The inclusion of information generated with STING Report in a research report or even a textbook, allows for the increased density of its contents. STING Report is freely accessible within the Gold STING Suite at http://www.cbi.cnptia.embrapa.br, http://www.es.embnet.org/SMS/, http://gibk26.bse.kyutech.ac.jp/SMS/ and http://trantor.bioc.columbia.edu/SMS (option: STING Report)

    Close 3D proximity of evolutionary breakpoints argues for the notion of spatial synteny

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Folding and intermingling of chromosomes has the potential of bringing close to each other loci that are very distant genomically or even on different chromosomes. On the other hand, genomic rearrangements also play a major role in the reorganisation of loci proximities. Whether the same loci are involved in both mechanisms has been studied in the case of somatic rearrangements, but never from an evolutionary standpoint.</p> <p>Results</p> <p>In this paper, we analysed the correlation between two datasets: (i) whole-genome chromatin contact data obtained in human cells using the Hi-C protocol; and (ii) a set of breakpoint regions resulting from evolutionary rearrangements which occurred since the split of the human and mouse lineages. Surprisingly, we found that two loci distant in the human genome but adjacent in the mouse genome are significantly more often observed in close proximity in the human nucleus than expected. Importantly, we show that this result holds for loci located on the same chromosome regardless of the genomic distance separating them, and the signal is stronger in gene-rich and open-chromatin regions.</p> <p>Conclusions</p> <p>These findings strongly suggest that part of the 3D organisation of chromosomes may be conserved across very large evolutionary distances. To characterise this phenomenon, we propose to use the notion of spatial synteny which generalises the notion of genomic synteny to the 3D case.</p

    Sampling solution traces for the problem of sorting permutations by signed reversals

    Get PDF
    International audienceBackgroundTraditional algorithms to solve the problem of sorting by signed reversals output just one optimal solution while the space of all optimal solutions can be huge. A so-called trace represents a group of solutions which share the same set of reversals that must be applied to sort the original permutation following a partial ordering. By using traces, we therefore can represent the set of optimal solutions in a more compact way. Algorithms for enumerating the complete set of traces of solutions were developed. However, due to their exponential complexity, their practical use is limited to small permutations. A partial enumeration of traces is a sampling of the complete set of traces and can be an alternative for the study of distinct evolutionary scenarios of big permutations. Ideally, the sampling should be done uniformly from the space of all optimal solutions. This is however conjectured to be ♯P-complete.ResultsWe propose and evaluate three algorithms for producing a sampling of the complete set of traces that instead can be shown in practice to preserve some of the characteristics of the space of all solutions. The first algorithm (RA) performs the construction of traces through a random selection of reversals on the list of optimal 1-sequences. The second algorithm (DFALT) consists in a slight modification of an algorithm that performs the complete enumeration of traces. Finally, the third algorithm (SWA) is based on a sliding window strategy to improve the enumeration of traces. All proposed algorithms were able to enumerate traces for permutations with up to 200 elements.ConclusionsWe analysed the distribution of the enumerated traces with respect to their height and average reversal length. Various works indicate that the reversal length can be an important aspect in genome rearrangements. The algorithms RA and SWA show a tendency to lose traces with high average reversal length. Such traces are however rare, and qualitatively our results show that, for testable-sized permutations, the algorithms DFALT and SWA produce distributions which approximate the reversal length distributions observed with a complete enumeration of the set of traces

    Enumeração de traces e Identificação de Breakpoints : Estudo de aspectos da evolução.

    No full text
    The study of genome rearrangements helps biologists understand the evolution of species. The species differentiation phenomenon are derived by analyzing mutational events (inversions, transpositions, fissions, fusions, etc) and their effects. In this context, this work aims the study of two different subjects: Traces Enumeration and Breakpoints Identification.Algorithms that solve the problem of sorting oriented permutations through reversals output only one optimal solution, although the set of solutions can be huge. The enumeration of traces of solutions for this problem allows a compact representation of the set of all optimal solutions which sort a permutation. By using this technique, biologists can study many evolutionary scenarios.We carried out a study to improve the efficiency of the enumeration algorithm by adopting a simple data structure. Due to the exponential nature of the problem, large permutations cannot be processed at a satisfactory time. Thus, in order to produce alternative evolutionary scenarios for large permutations, we proposed and evaluated strategies for partial enumeration of traces.Breakpointss are regions that border conserved segments in the chromosomes and reflect the occurrence of evolutionary rearrangements. The techniques for breakpoints identification are meant to identify such points in the chromosome sequences.In this work, we implemented a method proposed in the literature, that performs detection and refinement of breakpointss. The implementation is available as a package to other researchers. Additionally, we introduced a new methodology for breakpoints identification based on the analysis of the hit coverage observed in the alignments of intergenic sequences.Les algorithmes traditionnels pour le problème de tri de permutations signées par inversions produisent comme sortie une seule solution. Cependant, l'espace des solutions peut être gigantesque et le concept de traces est utilisé pour le représenter d'une manière plus compacte. Dans ce contexte, nous avons étudié des algorithmes pour l'énumération de traces et nous en proposons un qui est plus efficace. Il réduit ainsi la consommation en mémoire et le temps d'exécution du seul algorithme existant par un facteur de 10 et 5, respectivement. Malgré cette amélioration, le temps et l'espace nécessaires pour traiter de grosses permutations sont trop élevés et nous avons donc proposé et évalué trois algorithmes permettant un échantillonnage des solutions optimales.Pour que nous puissions étudier les réarrangements génomiques, il faut que nous soyons capables de bien identifier ses événements dans les génomes. Si nous considérons un pair de génomes, il est possible d'identifier les régions conservées (aussi connues comme "blocs de synténie") à travers de la comparaison des ordres et direction des marqueurs orthologues. Une région localisée entre deux blocs de synténies est appellée point de cassure. Lemaitre et al. ont développé une méthode formel pour la définitions et le raffinement des points de cassure en utilisant information des orthologies de gènes. Nous avons développé le logiciel Cassis qui implémente cette méthodologie. Cassis a été utilisé pour définir les points de cassures des génomes de l'homme et de la souris. Nous avons aligné des séquences intergéniques des deux espèces et nous avons observé que les régions internes aux points de cassure ont des scores d'alignement plus faibles que les régions qui sont externes. En utilisant ces résultats comme base, nous avons proposé une méthodologie pour l'identification des points de cassures qui n'utilise pas les informations d'orthologie. Cette méthodologie a été capable d'identifier 60% des points de cassures trouvés par Cassis.O estudo de rearranjo de genomas tem o objetivo de auxiliar o entendimento da evolução. Através da análise dos eventos de mutação como inversões, transposições, fissões, fusões, entre outros, buscamos compreender as suas influências sobre o fenômeno da diferenciação das espécies. Dentro deste contexto, esta tese ataca dois temas distintos: a Enumeração de Traces e a Identificação de Breakpoints. Os algoritmos de ordenação de permutações por reversões orientadas produzem uma única solução ótima enquanto o conjunto de soluções é imenso. A enumeração de traces de soluções para este problema oferece um modo mais compacto de representar o conjunto completo de soluções ótimas. Dessa maneira, esta técnica fornece aos biólogos a possibilidade de análise de diversos cenários evolutivos.Neste trabalho, realizamos um estudo para melhora da eficiência do algoritmo de enumeração através da adoção de uma estrutura de dados mais simples. Devido ao caráter exponencial do problema, grandes permutações não podem ser processadas em um tempo satisfatório. Assim, com o objetivo de produzir cenários evolucionários alternativos para grandes permutações, propomos e avaliamos estratégias para a enumeração parcial de traces.Os pontos de quebra (ou breakpoints) são regiões que delimitam os segmentos conservados existentes nos cromossomos e denotam a ocorrência de rearranjos evolutivos. As técnicas de identificação de breakpoints têm a função de identificar tais pontos nas sequências dos cromossomos. Nesta tese, implementamos um método de detecção e refinamento de pontos de quebra proposto na literatura e o disponibilizamos como um pacote que pode ser utilizado por outros pesquisadores. Além disso, introduzimos uma nova metodologia de identificação de breakpoints baseada na análise da cobertura de hits observada nos alinhamentos de sequências intergênicas, provenientes dos genomas das espécies comparadas

    An approach to detect and remove artifacts in EST sequences

    No full text
    Orientador: Zanoni DiasDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O sequenciamento de ESTs (Expressed Sequence Tag) [2] e uma tecnica que trabalha com bibliotecas de cDNAs tendo como objetivo a obtençao de uma boa aproximaçao para o ?ndice genico, que e a listagem de genes existentes no genoma do organismo estudado. Antes da serem analisadas, as sequencias obtidas do sequenciamento dos ESTs devem ser processadas para eliminaçao de artefatos. Artefatos sao trechos que nao pertencem ao organismo ou que possuem baixa qualidade ou baixa complexidade. Trechos de vetores, adaptadores e caudas poli-A podem ser citados como exemplos de artefatos. A eliminaçao dos artefatos deve ser feita para que a an'alise das sequencias produzidas no projeto nao seja prejudicada por estes ?ru?dos?. Por exemplo, artefatos presentes em sequencias freq¨uentemente produzem erros em processos de clusterizaçao, pois eles podem determinar se sequencias serao unidas em um mesmo cluster ou separadas em clusters diferentes. Observando a importancia da realizaçao de um bom processo de limpeza das sequencias, o trabalho desenvolvido nesta dissertaçao teve como principal objetivo a obtençao de um conjunto eficiente de procedimentos de detecçao e remoçao de artefatos. Este conjunto foi produzido a partir de uma nova estrategia de deteçao de artefatos. Normalmente, cada projeto de seq¨uenciamento possui seu proprio conjunto de procedimentos dividido em varias etapas. Estas etapas sao, em geral, ligadas entre si e o resultado de uma pode influenciar o resultado de outra. A nossa estrategia visa a realizaçao destas etapas de forma totalmente independente. Alem da avaliaçao desta nova estrategia, o trabalho tambem realizou um estudo mais detalhado sobre dois tipos de artefatos: baixa qualidade e derrapagem. Para cada um deles, algoritmos foram propostos e validados atraves de testes com conjuntos de seq¨u?encias produzidas em projetos reais de sequenciamento. O conjunto final de procedimentos, baseado nos estudos desenvolvidos durante a escrita deste texto, foi testado com as sequencias do projeto SUCEST [100, 103, 113] e mostrou bons resultados. O clustering produzido com as sequencias processadas por nossos metodos apresentou melhores consistencia interna e externa e menores taxas de redundancia quando comparado ao clustering original do projetoAbstract: Expressed Sequence Tag (EST) Sequencing [2] is one technique that works with cDNA libraries. It aims to achieve a good approximation for the gene index of an organism. Before analyzing the sequences obtained by sequencing ESTs, they must be processed for artifact removal. An artifact is a sequence that does not belong to the studied organism or that has low quality or low complexity. As example of artifacts, we have adapters, poly- A tails, vectors, etc. Artifacts removal must be performed because their presence can produce ?noises? in the sequencing project data analysis. For example, artifact can join two sequences in a same cluster inappropriately or separate them in two different clusters when they should be put together. Motivated by the sequence cleaning process importance, our main objective in this work was to develop an efficient set of procedures to detect and to remove sequence artifacts. Usually, each EST sequencing project has its own procedure set divided in many steps. These steps are, in general, linked and the result of one given step might influence the result of the next one. Our strategy was to perform each step independently assuring that any execution order of those steps would lead to the same result. Additionally to the new strategy evaluation, this work also studied detailedly two type of artifacts: low quality and slippage. For each one, algorithms were proposed and validated through tests with sequences of real sequencing projects. The final set of procedure, developed in this work, was evaluated using the sequences of the SUCEST project [100, 103, 113] and produced good results. The resulting clustering from our method has better external and internal consistency and lower redundacy rate than those produced by the SUCEST project clusteringMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Enumeration of traces and breakpoint identification : study of evolutionary aspects

    No full text
    Orientador: Zanoni DiasTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O estudo de rearranjo de genomas tem o objetivo de auxiliar o entendimento da evolução. Através da análise dos eventos de mutação como inversões, transposições, fissões, fusões, entre outros, buscamos compreender as suas influências sobre o fenômeno da diferenciação das espécies. Dentro deste contexto, esta tese ataca dois temas distintos: a Enumeração de Traces e a Identificação de Breakpoints. Os algoritmos de ordenação de permutações por reversões orientadas produzem uma única solução ótima enquanto o conjunto de soluções é imenso. A enumeração de traces de soluções para este problema oferece um modo mais compacto de representar o conjunto completo de soluções ótimas. Dessa maneira, esta técnica fornece aos biólogos a possibilidade de análise de diversos cenários evolutivos. Neste trabalho, realizamos um estudo para melhora da eficiência do algoritmo de enumeração através da adoção de uma estrutura de dados mais simples. Devido ao caráter exponencial do problema, grandes permutações não podem ser processadas em um tempo satisfatório. Assim, com o objetivo de produzir cenários evolucionários alternativos para grandes permutações, propomos e avaliamos estratégias para a enumeração parcial de traces. Os pontos de quebra (ou breakpoints) são regiões que delimitam os segmentos conservados existentes nos cromossomos e denotam a ocorrência de rearranjos evolutivos. As técnicas de identificação de breakpoints têm a função de identificar tais pontos nas sequências dos cromossomos. Nesta tese, implementamos um método de detecção e refinamento de pontos de quebra proposto na literatura e o disponibilizamos como um pacote que pode ser utilizado por outros pesquisadores. Além disso, introduzimos uma nova metodologia de identificação de breakpoints baseada na análise da cobertura de hits observada nos alinhamentos de sequências intergênicas, provenientes dos genomas das espécies comparadasAbstract: The study of genome rearrangements helps biologists understand the evolution of species. The species differentiation phenomenon are derived by analyzing mutational events (inversions, transpositions, fissions, fusions, etc) and their effects. In this context, this work aims the study of two different subjects: Traces Enumeration and Breakpoint Identification. Algorithms that solve the problem of sorting oriented permutations through reversals output only one optimal solution, although the set of solutions can be huge. The enumeration of traces of solutions for this problem allows a compact representation of the set of all optimal solutions which sort a permutation. By using this technique, biologists can study many evolutionary scenarios. We carried out a study to improve the efficiency of the enumeration algorithm by adopting a simple data structure. Due to the exponential nature of the problem, large permutations cannot be processed at a satisfactory time. Thus, in order to produce alternative evolutionary scenarios for large permutations, we proposed and evaluated strategies for partial enumeration of traces. Breakpoints are regions that border conserved segments in the chromosomes and reflect the occurrence of evolutionary rearrangements. The techniques for breakpoint identification are meant to identify such points in the chromosome sequences. In this work, we implemented a method proposed in the literature, that performs detection and refinement of breakpoints. The implementation is available as a package to other researchers. Additionally, we introduced a new methodology for breakpoint identification based on the analysis of the hit coverage observed in the alignments of intergenic sequencesDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Partial enumeration of solutions traces for the problem of sorting by signed reversals

    No full text
    International audienceTraditional algorithms to solve the problem of sorting by signed reversals output just one optional solution while the space of optimal solutions can be huge. Algorithms for enumerating the complete set of solutions traces were developed aiming to support biologists studies of alternative evolutionary scenarios. Due to the exponential complexity of the algorithms, their practical use is limited to small permutations. In this work, we propose and evaluate three di erent approaches to producing a partial enumeration of the complete set of traces to transform a given permutation to another one
    corecore